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DETAILED ACTION 

1 . This application has been examined claims 1 -1 2 are pending in this 

application. 
Priority 

2. Applicant's claim for the benefit of a prior-filed application under 35 U.S.C. 
119(e) or under 35 U.S.C. 120, 121, or 365(c) is acknowledged. Applicant has not 
complied with one or more conditions for receiving the benefit of an earlier filing date 
under 35 U.S.C. 371 as follows: 

Priority is over 30 months .US filling date is 8/03/2006. PCT was filled on 04/14/2003. 

Information Disclosure Statement 

3. The Examiner has considered the references listed on the Information 
Disclosure statement submitted on 10/12/2005 (see attached PTO-1449. 

Drawings 

4. The examiner contends that the drawings submitted on 10/12/2005 are 
acceptable for examination proceedings 



Claim Rejections - 35 USC § 102 

5. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that 
form the basis for the rejections under this section made in this Office action: 
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A person shall be entitled to a patent unless - 

(e) the invention was described in (1) an application for patent, published under section 122(b), by 
another filed in the United States before the invention by the applicant for patent or (2) a patent 
granted on an application for patent by another filed in the United States before the invention by the 
applicant for patent, except that an international application filed under the treaty defined in section 
351 (a) shall have the effects for purposes of this subsection of an application filed in the United States 
only if the international application designated the United States and was published under Article 21(2) 
of such treaty in the English language. 

6. Claims 1 -1 2 are rejected under 35 U.S.C. 1 02(e) as being anticipated by 
Moulton et al (US 6778252 B2). 

Regarding claim 1 , Moulton discloses a system (I) for performing automatic 
dubbing on an incoming audio-visual stream (2) (FIG. 1 is a block diagram of the audio- 
visual dubbing system see FIG. 1 and drawing description)., said system (1) comprising: 
means (3, 7) for identifying the speech content in the audio-visual stream (2); a speech- 
to-text converter (13) for converting the speech content into a digital text format (14); a 
translating system (15) for translating the digital text (14) into another language or 
dialect; a speech synthesizer (19) for synthesizing the translated text (18) into a speech 
output (21); and a synchronizing system (9, 12, 22, 23, 26, 31, 33, 34, 35) for 
synchronizing the speech output (21 ) to an outgoing audio-visual stream (28) (there 
have been recent developments towards automating the voice dubbing process using 
2D based techniques to modify archival footage, using computer vision techniques and 
audio speech recognition techniques to identify, analyze and capture visual motions 
associated with specific speech utterances. Prior approaches have concentrated on 
creating concatenated based synthesis of new visuals to synchronize with new voice 
dub tracks from the same or other actors, in the same or other languages. This 
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approach analyzes screen actor speech to convert it into triphones and/or phonemes 
and then uses a time coded phoneme stream to identify corresponding visual facial 
motions of the jaw, lips, visible tongue and visible teeth, these single frame snapshots 
or multi-frame clips of facial motion corresponding to speech phoneme utterance states 
and transformations are stored in a database, which are then subsequently used to 
animate the original screen actor's face, synchronized to a new voice track that has 
been converted into a time-coded, image frame-indexed phoneme stream (col:1 lines 
29-47). 

Regarding claim 2, Moulton discloses the system (1), containing a voice profiler 
(10) for generating voice profiles (ii) for the speech content and for allocating the 
appropriate voice profile (1 1 ) to the translated text (14) for speech output synthesis (The 
voice recognition system generates a time-stamped annotation database of individual 
frames and associated computer estimated phonemes and diphones, and the estimated 
pure or mixed phoneme combination corresponding to the frame see (col: 1 lines 19- 
25). 

Regarding claim 3, Moulton discloses the system (I) .wherein the system (i) 
contains a source of time data (4) for the allocation of timing information to the audio 
and video contents (4, 5) for later synchronization of these contents (these single frame 
snapshots or multi-frame clips of facial motion corresponding to speech phoneme 
utterance states and transformations are stored in a database, which are then 
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subsequently used to animate the original screen actor's face, synchronized to a new 
voice track that has been converted into a time-coded, image frame-indexed phoneme 
stream (col:1 lines 41-47)). 

Regarding claim 4, Moulton discloses the system (1), wherein the translation 
system ((15) ( there are many cinematic and television works where it is desirable to 
have a language translation dub of an original cinematic or dramatic work, where the 
original recorded voice track is replaced with a new voice track that therefore inherent 
translation system see col:1 lines 15-19)contains a language database (17) with a 
plurality of different languages and/or dialects and means for selection of a language or 
dialect from this database (17) into which the digital text (14) is to be translated (the 
elicited reference audio-visual database is then supplemented by the original target 
screen footage, to be re-synchronized to another language. The computer vision tracks 
and estimates the position of the control points mapped to the mouth as they move in 
the target production footage see col:3 lines 50-55). 

Regarding claim 5, Moulton discloses the system (1) wherein the system (1) 
contains an open-caption generator (29) for the creation of open captions (30) using the 
digital text (14) and/or the translated digital text (18), for inclusion in an outgoing audio- 
visual stream (28) (The speech track is time-stamped to frames. (FIG. 1, Block 130 and 
170). Computer voice recognition of the original recorded speech track is executed. 
(FIG. 1, Block 210, 200,) The voice recognition can be additionally aided by working 
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with a prior known speech text transcript see col:7 lines 16-21), also ( there are many 
cinematic and television works where it is desirable to have a language translation dub 
of an original cinematic or dramatic work, where the original recorded voice track is 
replaced with a new voice track that therefore inherent translation system see col:1 lines 
15-19). 

Regarding claim 6, Moulton discloses an audio-visual device comprising a system 
(1 ) (FIG. 1 is a block diagram of the audio-visual dubbing system see FIG. 1 and 
drawing description). 

Regarding claim 7, Moulton discloses a method for automatic dubbing of an 
incoming audio-visual stream (2) (FIG. 1 is a block diagram of the audio-visual dubbing 
system see FIG. 1 and drawing description), which method comprises: identifying the 
speech content in the audio-visual stream (2) (using computer vision techniques and 
audio speech recognition techniques to identify, analyze and capture visual motions 
associated with specific speech utterances. Prior approaches have concentrated on 
creating concatenated based synthesis of new visuals to synchronize with new voice 
dub tracks from the same or other actors see col: 1 lines 32-37); converting the speech 
content into a digital text format (14); translating the digital text (14) into another 
language or dialect; converting the translated text (18) into a speech output (21) (the 
elicited reference audio-visual database is then supplemented by the original target 
screen footage, to be re-synchronized to another language. The computer vision tracks 
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and estimates the position of the control points mapped to the mouth as they move in 
the target production footage see col:3 lines 50-55).; synchronizing the speech output 
(21) to an outgoing audio-visual stream (28) ( The result is such that the image frames 
show sequential lip motion that is now visually synchronized to the new dub speech 
track see col:4 lines 60-62). 

Regarding claim 8, Moulton discloses the method, wherein voice profiles (ii) for the 
speech content are generated and allocated to the appropriate translated text (18) in the 
synthesis of speech output (21 ) (The voice recognition system generates a time- 
stamped annotation database of individual frames and associated computer estimated 
phonemes and diphones, and the estimated pure or mixed phoneme combination 
corresponding to the frame see (col: 1 lines 19-25)). 

Regarding claim 9, Moulton discloses the method .wherein a copy of the speech 
content is diverted from the audio-visual stream (2) or from an audio content of the 
audio-visual stream (2) ( using computer vision techniques and audio speech 
recognition techniques to identify, analyze and capture visual motions associated with 
specific speech utterances. Prior approaches have concentrated on creating 
concatenated based synthesis of new visuals to synchronize with new voice dub tracks 
from the same or other actors see col: 1 lines 32-37) also (For commonly used speech 
transformations, commonly used spine based curve fitting techniques are applied to 
estimate and closely match the recorded relative spatial paths and the rate of relative 
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motions during different transformations. The estimated motion path for any reference 
point on the face, in conjunction with all the reference points and rates of relative motion 
change during any mouth shape transformation, is saved and indexed for later 
production usage see col:3 lines 8-16). 

Regarding claim 10, Moulton discloses the method .wherein the speech content in 
the audio-visual stream (2) is separated from the remaining audio-visual stream or from 
an remaining audio content of the audio-visual stream (2) (the automatically 
incorporated viseme offsets to control vertices may contain emotional or other non - 
speech expressive content and shape. The degree of viseme offset may be given a 
separate channel control in a multi-channel mixer approach to animation control. Thus, 
any radar measurement motion tracking of the lip position may be separated into 
discreet component channels of shape expression see col: 6 lines 37-44). 

Regarding claim 11, Moulton discloses the method ,_wherein an audio/video 
combiner (26) inserts the speech output (21) into the outgoing audio-visual stream (28), 
replacing the original speech content (approaches have concentrated on creating 
concatenated based synthesis of new visuals to synchronize with new voice dub tracks 
from the same or other actors, in the same or other languages. This approach analyzes 
screen actor speech to convert it into triphones and/or phonemes and then uses a time 
coded phoneme stream to identify_corresponding visual facial motions see (col:1 lines 
35-40). 
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Regarding claim 12, Moulton discloses the method, wherein an audio/video 
combiner (26) overlays the speech output (21) into the outgoing audio-visual stream 
(28). (Producing composite visemes on the fly. This is accomplished by means of using 
continuous dub speaker radar measurement, or applying actor speech facial motion 
tracking techniques, or using multi-channel character animation techniques, such as 
used in Pixels3D see (col:5 lines 34-39) 

The viseme CV fixed reference control points are exactly registered to the original 
screen actor facial position for the eyes and nose position, to place them exactly to the 
head position, scaled to the correct size. The moving and scaling and positioning 
actions are done manually using standard 3D computer graphics overlay and 
compositing tools see col: 4 lines 19-25). 

Conclusion 

The prior art made of record and not relied upon is considered pertinent to applicant's 
disclosure. 

(US 6919892 B1), (Cheiky et al) discloses, Photo realistic talking head creation system 
and method. 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to KHALID ABDALLA whose telephone number is 
(571 )270-7526. The examiner can normally be reached on MONDAY THROUGH 
FRIDAY 7 AM TO 5 PM. 
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If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, JINHEE LEE can be reached on 571-272-1977. The fax phone number for 
the organization where this application or proceeding is assigned is 571-273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
IK. A./ 

Examiner, Art Unit 4173 

/Jinhee J Lee/ 

Supervisory Patent Examiner, Art 
Unit 4173 



